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Abstract. 

This  work  discusses  the  representation  and  manipulation  of  sets  based 
on  two  different  concepts:  tries,  and  hashing  functions. 

The  sets  considered  here  are  assumed  to  be  static:  once  created,  there 
will  be  no  further  insertions  or  deletions.  For  both  trie-  and  hash-based 
strategies,  a series  of  representations  is  introduced  which  together  with 
the  availability  of  preprocessing  reduces  the  average  sizes  of  the  sets 
to  nearly  optimal  values,  yet  retains  the  inherently  good  retrieval 
characteristics. 

The  intersection  procedure  for  trie-based  representations  is  based 
on  the  traversal  in  parallel  of  the  tries  representing  the  sets  to  be 
intersected,  and  it  behaves  like  a series  of  binary  searches  when  the  sets 
to  be  intersected  are  of  very  different  sizes.  Hashed  intersection  runs 
very  fast.  The  average  time  is  proportional  to  the  size  of  the  smallest 
set  to  be  intersected  and  is  independent  of  the  number  of  sets  (except 
for  the  intersection  set  itself  which  has  to  be  checked  for  every  set). 
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1.  Introduction. 


Probably  nothing  would  be  closer  to  a "computer  sciences  panacea" 
than  a general  purpose,  all -efficient  representation  for  sets.  Fortunately 
the  problem  has  not  been  solved  yet  so  there  are  lots  of  topics  for  research 
(and  more  work  for  computer  scientists  ...). 

And  thus  this  thesis  deals  with  sets:  how  to  represent  them 
(discussing  two  different  representation  techniques)  and  how  to  perform 
same  basic  operations  (intersection  in  particular)  on  them. 

On  Sets  and  Their  Representation. 

Throughout  this  work  a set  P will  be  a subset  of  a universe  U . 
Usually  that  universe  will  be  the  set  of  positive  integers  that  can  be 

g 

represented  with  s bits:  Ug  = [0,2  ) . But  as  a limit  case,  useful 
for  obtaining  asymptotic  results,  an  infinite  universe  of  real  numbers  will 
also  be  considered:  U = [0,1)  . 

Since  most  of  the  results  deal  with  average  cases  we  need  to  define 
our  random  set  P of  size  |P|  = n : 

g 

(i)  In  the  case  of  the  finite  universe  Ug  = [0,2  ) we  will  assume 
al  1 the  sets  of  n elements  to  be  equally  probable: 


Prob[P  = {p1,p2,...,pn}]  = !/(  2n  ) 


(1.1) 


(ii)  For  the  infinite  universe  U = [0,l)  , we  will  consider  the 

elements  of  P as  independent,  uniformly  distributed  real  numbers 


l 


in  U , so  that 


for  any  interval  [a,  3)  c U , and-  for  any  xeP  , 
Prob[x  e [at,  f3)  ] = p-ct 


(1.2) 


One  of  the  characteristics  of  a representation  that  will  be 
investigated  is  the  size  of  the  average  set.  We  will  consistently 
discuss  theoretically  attainable  estimates  of  the  number  of  bits  needed 
to  store  the  set.  Given  actual  machine  idiosyncracies  (addressing, 
bit  handling  capabilities,  etc. ) whoever  implements  the  algorithms 
might  be  forced  to  store  bit  strings  within  byte  or  word  boundaries, 
thus  increasing  the  actual  sizes. 

The  sets  will  be  static,  in  the  sense  that  once  created  they  don't 
change,  or  they  change  so  infrequently  that  updates  may  be  processed 
as  creation  of  new  sets.  (This  is  the  case  of  many  big  data  bases 
organized  by  means  of  inverted  files.  Updates  are  usually  kept  in  a 
"news"  list,  and  batch-processed  periodically. ) The  importance  of  this 
point  is  that  it  allows  for  a reasonable  amount  of  preprocessing  when 
creating  each  set,  making  possible  faster  operations  on  the  sets. 

On  Set  Intersection. 

Intersection  is  the  set  operation  we  will  investigate  in  most  detail, 
and  again  we  need  to  define  a statistical  model  in  order  to  obtain 
average  behaviors. 

But,  why  not  simply  assume  that  the  sets  are  random  as  defined  above? 
We  cannot,  and  the  reason  is  that  such  an  assumption  would  produce  a 
randomly  small  intersection,  and  that  is  not  the  case.  In  general, 

Jane  Q.  User  has  some  knowledge  of  the  existence  and  size  of  the 
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f 

intersection:  for  instance,  in  a library  model  (where  sets  are  inverted 
files  of  books  with  a certain  attribute)  it  is  very  rare  that  a user  will  try  to 
retrieve  books  under  both  the  keywords  "Russian  literature"  and  "Ostriches". 

Thus  somehow  the  intersection  size  has  to  be  thrown  into  the  equation. 

Assume  that  we  want  to  intersect  the  sets  P^, P^, . . Pm  • Let 
P = ( |P1 1,  |P2 |, . . |Pm| ) be  the  sequence  of  sizes  and 
Q,  = Pq  0 P2  0 • • • n Pm  > |0,|  = q.  , the  intersection  to  be  computed. 

We  will  define  the  average  intersection  time,  l(P,  q)  , according  to  the 
following  model: 

• the  intersection  set,  Q , is  any  arbitrary  set  of  size  q (with 
the  obvious  restriction  that  q < min{ |P^|, |Pg |, . . ., |Pm| } ), 

• the  subsets  = P^  - Q are  random  subsets  of  U - Q in  the  sense 
that  for  all  xeU 

|W.  | 

Prob[x  e W.  ] = — . (1.3) 

1 |U-Q| 

Notice  that  given  the  above  definition  there  is  a chance  that  the  's 
do  actually  yield  a nonempty  intersection.  But  that  will  only  mean  that 
we  are  getting  a larger  intersection  for  the  price  of  a smaller  one. 

Given  the  above  model,  the  average  intersection  time  can  be  divided 

into 

• v(P,  q)  , the  time  needed  to  verify  that  Q is  a true  subset  of 
all  Pi  ' s,  plus 

• t(P,  q)  , the  time  needed  (or  "wasted")  to  verify  that  the  intersection 
of  the  ’ s is  empty. 
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Thus 

l(P,q)  = V(P,q)  + t(P,q)  . (1.4) 

The  above  discussion  does  not  imply  that  intersections  will  be  computed 
in  two  distinct  steps,  on  the  contrary,  both  algorithms  introduced  in 
this  work  proceed  without  any  knowledge  of  what  Q is  and  what  belongs 
in  the  's.  But  Equation  (1.4)  underlies  the  methodology  used  to  obtain 
bounds  for  the  intersection  times. 

On  the  other  hand,  V(P,  q)  may  be  seen  as  a measure  of  the  "productive 
time":  a user  will  usually  be  glad  to  know  that  s/he  has  to  pay  V(P,  q)/q 
per  element  found  (in  the  intersection).  Similarly  t(P,  q)  is  a measure 
of  the  "risk"  involved  in  computing  the  intersection  and  finding  nothing. 

On  Set  Union. 

In  certain  representations,  computing  the  union  of  two  sets 
requires  essentially  the  same  time  as  computing  their  intersection. 

Though  this  work  will  not  present  any  detailed  account  of  algorithms 

to  compute  unions,  a short  discussion,  mainly  for  purposes 

of  comparison  with  current  results,  will  be  included  in  the  last  chapter. 

A Gourmet  Guide  to  the  Contents. 

Two  representations  of  sets  are  discussed  in  this  thesis. 

The  first  one,  based  on  tries,  is  covered  in  Chapters  2 and  3.  Chapter  2 
discusses  issues  about  trie  representation  (this  chapter  relies  heavily 
on  Appendix  A,  which  presents  some  needed  counting  results  about  tries) 
and  Chapter  3 discusses  the  trie  intersection  algorithm. 


The  second  representation,  based  on  hashing  techniques,  is  presented 


1 


in  Chapter  4.  Again  the  representation  proper  is  discussed  first 
(Section  4.1)  after  which  come  the  intersection  methods  (Section  4.2). 
Finally,  Chapter  5 presents  the  conclusions  anc.  practical  considerations. 

It  is  traditional  for  the  first  chapter  of  a thesis  to  state  how  much 
of  the  material  is  "new"  as  opposed  to  well-known.  In  this  case,  essentially 
all  of  the  algorithms  and  analyses  to  be  discussed  appear  to  be  new,  except 
for  some  auxiliary  formulas  that  are  quoted  from  the  literature.  The  basic 
ideas  of  hashing  and  trie  representation  are,  of  course,  well  known,  and 
the  analysis  of  these  basic  methods  are  stated  in  order  to  demonstrate  the 
improvements  being  made. 

Notation  and  Background. 

The  notation  used  throughout  this  thesis  is  pretty  standard  (so  it 
will  not  be  discussed  further)  with  the  exception  of  some  sort  of  "vector- 
like" notation  required  in  Chapter  3 (where  it  is  duly  presented) . 

Both  the  concepts  of  tries  and  hash  functions  are  excellently  covered 
by  Knuth  in  [Knuth  75]  (Section  6.3  covers  tries  and  Section  6.4  hashing) 
and  the  interested  reader  is  referred  to  it  for  further  details  about  the 
above  data  structures. 


2.  Trie  Representation  of  Sets. 

Khowing  the  bounds  of  the  -universe  (and  a total  order  relation  over 
its  elements)  is  the  key  to  representing  any  subset  of  such  a universe  as  a 
trie,  and  this  chapter  discusses  different  ways  of  doing  that. 

Throughout  this  chapter  we  will  be  concerned  with  representations 
and  their  space  requirements,  leaving  the  efficiency  of  intersection 
to  Chapter  3« 

When  talking  about  space  requirements,  a measure  of  the  minimum  space 
needed  to  represent  a set  is  always  handy.  Let  the  universe  be 

Us  = [0,2S) 


and  M(s, n)  the  minimum  average  number  of  bits  required  to  represent  a 
subset  S c Ug  of  |s|  = n elements. 


Since  there  are 


subsets  of  size  n 


(■: ) 

M(s,n)  > f )"]  > ) 


it  is  obvious  that 

(2.0.1) 


And  assuming  relatively  small  subsets  (n  < 2S/s 
M(s,n)  > n^s  - lg  n + ^ | lg 

But  since  lg  n < s we  can  state  that 

M(s,n)  > n^s  - lg  n + tQT2  ) " c 


will  suffice) 
n + 0(1)  . 


we  obtain 

(2.0.2) 


(2.0.3) 


where  c is  some  appropriate  constant. 

Each  different  representation  presented  below  will  be  associated  with 
a subscript  and  for  each  one  some  of  the  following  average  values  will  be 
investigated; 
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*U*iv 


t±  (s,  n) 
bi(s,n) 
A±(s,n) 
Pi(s,n) 
^(s^n) 
and  B.  (s,n) 


size  of  tag  bits  per  node, 
size  of  a node  in  bits, 
number  of  nodes, 

total  number  of  bits  used  to  store  pointers, 
total  number  of  bits  used  to  store  the  set  elements, 
total  size  in  bits. 


(The  above  quantities  refer  to  "Representation  i ",  and  to  subsets  of  n 
elements  from  the  universe  [0,2  ) .) 

Also  if  R is  a set,  the  quantities  t^(s, R)  , b^(s,  R)  , P^(s,  R)  , 

T^(s,R)  , A^(s,R)  and  B^(s,  R)  will  denote  the  actual  values  for  the 

set  R , not  averages. 

A certain  mythical  representation  that  asymptotically  realizes  the 
optimum  M(s,n)  will  be  our  representation  0 ; thus 

BQ(s,n)  = n^s  - lgn+^j  . (2.0.4) 

For  comparison  purposes,  the  good  old  sequential  allocation 
(with  each  cell  viewed  as  a node)  will  be  representation  1 , and 


tx(s,n)  = 0 
b2(s,n)  = s 
A1(s,n)  = n 


P1(s,n)  = 0 
T^(s,  n)  = ns 
B1(s,n)  = ns  . 


(2.0.5) 


The  following  example  will  be  used  throughout  this  chapter. 


Q 

Example  2.0.1.  Let  the  universe  be  the  set  Ug  = [0,2°)  ; we  will  use 
the  subset 

S = [5  , 25  , 50  , 51 , 130  , 189  , 200  , 232  , 249}  , 

or  in  octal  S = [003  , 031 , 062 , 063 , 202  , 275 , 310 , 350 , 371}  . Notice 
that  j S | =9  and  the  sequential  representation  of  S requires 
B1(8,S)  = 8 x 9 = 72  bits.  The  optimum  in  this  case  is  BQ(8, 9)  = 56.5  • 
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2.1  Binary  Tries. 


Bisecting  the  universe  sounds  like  the  most  "natural"  kind  of  partition 


and  thus  binary  tries  are  born. 


The  obvious  way,  representation  2 , uses  the  data  structure  depicted 


in  Figure  2.1.1.  Under  such  representations  each  node  requires  space  for 


two  tags  (one  bit  each)  and  two  fields  large  enough  to  hold  either  a value 


or  a pointer,  hence 


b2(s,n)  = 2(s+l)  . 


(2.1.1) 


The  average  number  of  nodes  presents  us  with  some  difficulties,  since 


there  appears  to  be  no  reasonably  simple  expression  for  it.  The  analysis 


in  Appendix  A (Section  A.l)  shows  that 


5>n) . ^ 0j<s[2d(2TJ)-s(2^D]- 


a really  horrible  expression. 


Fortunately,  we  can  bound  A^s, n)  , since  in  general 
A^s+^n)  > A^{s,n)  (see  ApPend-ix  A,  Lemma  A. 2.1),  and  by  Lemma  A.2.3 


lim  A^Sjn)  = A(n) 


(2.1.3) 


where  A(n)  is  the  number  of  nodes  in  a trie  representation  of  a random 


set  of  n elements  in  U = [0,1)  . 


In  practice  the  difference  between  A^(s, n)  and  A(n)  is  very  small. 


since  usually  the  sets  to  be  represented  are  much  smaller  than  the  universe 


(it  is  possible  to  prove  that 


A^Sjn)  = A(n)(l- 0(ns/(2s-n)))  , 


see  Appendix  A,  Section  A. 6,  and  thus  the  leading  term  of  A(n)  applies  to 


Ap(s,n)  when  n = 0(2S'  ) , the  usual  case  in  practical  applications). 


N 


N 


T I T I 202  I 275  T N 310 


T T 003  031 


1 


T T 350  371 


^(8,8)  = 13  nodes 


b^(8,S)  = 2+16  = 18  bits/node 
B2(8,S)  = 13  xl8  = 23l  bits 


062  063 


Figure  2.1.1.  Representation  2 


Tag  fields  store  values  N (nonterminal  subtrie)  or  T (terminal 
subtrie).  The  other  fields  store  either  a pointer  (including  the 
mill  one)  or  a set  element.  A pointer  is  used  when  there  is  more 
than  one  element  to  be  represented.  The  left  subtrie  at  level  k 
represents  the  subset  with  bit  k equal  to  0 , the  right  subtrie 
represents  bit  k equal  to  1 , where  bits  are  numbered  0 ...  7 frcin 
left  to  right.  Notice  that  the  set  elements  appear  in  octal  notation. 


***** 


For  the  rest  of  the  chapter  we  will  adopt  the  estimates  for  tries 


in  the  infinite  universe  U = [0,1)  for  the  actual  case  U = [0,2  ) 

s 

whenever  the  quantities  for  the  infinite  universe  are  upper  bounds  for 
the  corresponding  one  in  the  finite  case. 

(The  interested  reader  is  referred  to  Section  A.  3 for  further 
discussion  of  asymptotic  results  for  the  infinite  universe  case.) 

So  we  will  adopt 

A2(s,n)  = n/ln  2 , (2.1.4) 

giving 

BQ(s,n)  = (s+l)n  = n(2.88s  + 2.88)  . (2.1.5) 

Space  requirements  for  the  above  basic  representation  do  not  look  very 
good,  and  they  certainly  can  be  improved.  A first  (and  easy)  improvement, 
representation  3 , is  to  eliminate  the  space  wasted  in  null  pointers.  (There 
are  A^(s,n)  -n  + 1 null  pointers  requiring  s bits  each.)  Figure  2.1.2 
illustrates  representation  3. 

Both  left  and  right  tags  now  need  to  take  three  different  values 
(terminal,  nonterminal,  or  null  subtrie),  hence 

t^(s,n)  = 4 bits  . (2.1.6) 

The  number  of  nodes  is  still 

Aj(s,n)  = A2(s,n)  = 5^  . (2.1.7) 


And  since  A (s,n)-l  pointers  are  needed 


T I T I 202  I 275  I It  In  I 310 


T T 350  371 


A^(8,  S)  = 13  nodes 

^(8,S)  = 13  x ^ + 12x8  + 9x8 


= 220  bits 


Figure  2.1.2.  Representation  3 . 

Tags  notation 

N = nonterminal  subtrie 
T = terminal  subtrie 
X = null  subtrie. 
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and 

T^(s,n)  = n*s  . (2.1.9) 

So 

B^(s,n)  = 4 Ufa  + n(lF2  + 1)S  = n(2.^s  + 5.77)  . (2.1.10) 

Another  way  to  improve  on  representation  2 is  to  store  the  nodes  in 
some  fixed  order  (Preorder  mit  Variazionen),  thus  requiring  at  most  one 
pointer  per  node  (since  the  left  sibling  of  a node  is  always  stored  right 
after  the  node  itself).  This  is  representation  4 , and  is  depicted  in 
Figure  2.1.3. 

Only  six  different  tag  combinations  are  possible,  hence 


t^(s,n)  = 3 bits  . (2.1.11) 

The  number  of  nodes  is  still  the  same 

A^(s,n)  = Ag(s,  n)  = ^ 2 * (2.1.12) 

A pointer  is  needed  only  -when  both  subtries  sire  nonterminal.  The  analysis 
in  Section  A.k  shows  that  about  one  of  every  four  nodes  has  two  nonterminal 
siblings,  so 

Pu(s,n)  = l 3^2  • s . (2.1.13) 

In  order  to  store  the  set  elements  we  need 

T^(s,n)  = n.s  . (2.1.1k) 

Finally, 


BL (s,n)  = + J iH%+n-s  " n(1.36s+U.33)  • (2.1.15) 


s 

j 


Figure  2.1.5 . Representation  k . 


Possible  tag  combinations  are  NN>NT,TN,TT>NX  and  XN  . 
Pointer/value  field  is  used  as  a pointer  when  tag  is  NN  , and  as 
a value  if  tag  is  TN  or  NT  . When  the  tag  is  TT  the  two  set 
elements  follow  the  node.  The  nodes  are  stored  in  preorder  except 
that  a terminal  right  subtrie  precedes  its  nonterminal  left  brother. 
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z V.  -* 


variable  length  fields  we  can  improve  the  storage  requirements  in  two 
ways: 

(i)  Using  the  information  implicit  in  the  trie  structure  allows  us  to 
store  only  that  suffix  of  each  set  element  that  has  not  been  defined 
by  the  path  from  the  root  to  the  parent  node. 

(ii)  Defining  pointers  as  displacements  from  the  current  node  (and,  of 
course,  storing  nodes  in,  say,  preorder)  requires  pointers  of  size 
proportional  to  tne  logarithm  of  the  number  of  elements  in  the 
subtries. 

And  so  we  get  to  representation  5,  similar  in  organization  to 
representation  4 but  with  variable  length  pointer  and  set  element  fields 
(refer  to  Figure  2.1.4). 

Tag  size  and  number  of  nodes  remain  the  same  as  before 


t5(s,n)  = 3 

A5(s,n)  = ^ 


(2.1.16) 


The  space  required  to  store  the  set  elements  also  needs  some  thought: 
we  will  estimate 

T^(s,n)  = n-s-n*  (average  depth)  (2.1.17) 

where  the  average  depth  of  a terminal  element  in  the  trie  has  already  been 
discussed  by  Knuth  [Knuth  75*  Section  6. 3]  for  the  case  s = » . Actually 
the  quantity  " n* (average  depth)  " above  is  the  average  total  number  of  bit 
inspections  needed  to  look  up  the  n keys  in  the  set  (that  is  Cn  as 
defined  by  Equation  6.3(5)  in  [Khuth  73 ]),  so  we  obtain 
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A*.  (8,  S)  = 13  nodes 

Bj  (8,  S)  = 13  X 3 + 

+ 16  (for  pointers) 


Implicit:  ("062" ) ("063" ) 


+31  (for  set  elements) 


= 86  bits 


Figure  2.1.4.  Representation  5 . 


Pointer  sizes  were  assigned  arbitrarily,  for  completeness  sake. 
(They  depend  on  the  pointer  encoding  scheme  used. ) 


bits  on  the  average,  and  therefore,  also  on  the  average 
8 < k lg  k 

so  that  8 can  be  encoded  with  (l+e)lgk  bits  (where  e is  a small 
positive  constant  less  than  l/(e  . In  2)  ). 

So  we  have  discovered  that  with  the  above  convention,  the 
size  of  the  pointers  becomes  independent  of  the  size  of  the  universe,  so 
that  the  root  node  for  a set  of  n elements  requires  a pointer  of  size 
(2+£)  lg  n . (The  term  elgn  also  absorbs  any  overhead  due  to  the 
variable  length  encoding  scheme  for  the  quantities  n and  8 . See 
[Even  and  Rodeh  78]  for  a possible  encoding  scheme.) 

So  now  we  may  compute 

as  a good  estimate  for  the  total  pointer  space  requirements,  P^(s,n)  . 
(Notice  that  we  assign  pointer  space  to  every  node,  even  though  in  practice 
many  nodes  don't  need  it.)  Again  Appendix  A provides  us  with  the  answer 
(Section  A. 5)  and  we  dutifully  adopt  it  as 


(2.1.19) 
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P5(s,n) 


(2+e)  • 2.96  n 


(2.1.20) 


(a  remarkable  result:  about  6 bits  per  element  are  needed  for  pointers). 
So  allowing  e to  increase  P^(s,n)  up  to,  say,  7n  , we  obtain 

B^(s,n)  = 2 + 7n+n  s-lgn  - 2 - 2 


= n(2  - lg  n+  10) 


(2.1.21) 


Tries  include  chains  of  one-way  branching  nodes,  whose  only  function  is  to 
convey  path  information.  This  clearly  suggests  the  possibility  of  representing 
that  path  as  a bit  string  and  getting  rid  of  such  nodes,  and  that  turns  out 
to  be  our  representation  6 . 

In  this  representation,  each  node  includes  a (possibly  empty)  prefix 
field,  that  replaces  any  eliminated  chain  immediately  preceding  the  current 
node,  as  may  be  seen  in  Figure  2.1.5.  Representation  6 is  somehow  related 
to  the  idea  behind  Patricia  tries  presented  in  [Morrison  68]. 

The  tag  size  is 


tg(s,n)  = 2 (2.1.22) 

and  the  number  of  nodes 

Ag(s,n)  = n-1  . (2.1.25) 

Pointer  and  set  element  space  are  (as  in  the  previous  case) 

Pg(s,n)  = P^(s,n)  = (2+e)  • 2.96  n < 7n 

(2.1.2U) 

Tg(s,n)  = T5(s,n)  = n - lg  n - - | ^ 


17 


Each  one-way  branching  node  in,  say,  representation  5 generates  two 
bits  of  prefix  space,  and  there  are  [A^(s,n)  -n]  of  those  nodes.  So 
after  including  the  stopper  string  " 1 ",  the  space  required  to  store 
prefixes  comes  to 

Prefix(s,n)  = (pfg  - n)  x2  + n = . (2.1.25) 


Finally, 


B6(s,n)  = 2(n-l)  +7n  + n s-lgn  - _ _ + n 


= n(s-lgn+9*56) 


And  thus  we  complete  our  rather  prolonged  tour  through  binary  trie 
representation  of  sets. 

« closing  remark:  it  is  interesting  to  notice  that  it  is  possible 
to  encode  the  set  and  the  trie  structure  and  still  come  within  (c-n) 
bits  of  the  lower  space  bound  (2.0.3). 


(2.1.26) 


. ' ’ : .-v 


2.2  Other  Representations  Based  on  Tries. 


Several  variations  on  the  basic  binary  trie  may  be  used,  and  this 
section  will  simply  state  some  of  them  without  further  analysis. 

First,  there  is  an  obvious  generalization  of  binary  to  M-ary  tries. 

An  M-ary  trie  representation  reduces  the  average  number  of  nodes  to 
approximately  n/ (M  In  2)  , but  it  forces  an  increase  in  the  size  of 
each  node.  Further,  storing  the  nodes  in  a fixed  order  (as  in  the  last 
three  representations  of  Section  2.1)  will  only  save  one  out  of  M 
pointers.  A way  to  produce  more  economical  M-ary  tries  is  to  store  only 
the  non-null  pointers,  though  a price  is  paid  in  the  increase  in  time 
needed  to  process  each  node  for  set  operations. 

An  interesting  way  to  represent  M-ary  tries  (compressed  tries  or 
C-tries)  is  presented  by  Maly  [Maly  7 6],  A C-trie  stores  only  a pointer 
to  the  first  sibling  of  each  node,  and  stores  all  the  siblings  sequentially. 

As  an  important  consequence  of  using  M-ary  tries,  the  parameter  M 
may  be  chosen  in  order  to  minimize  the  "waste"  if  the  data  structure  is 
to  be  stored  within  byte  or  word  boundaries. 

A second  idea  follows  the  old  paradigm  "keep  small  things  simple...": 
there  is  no  better  way  to  store  a small  set  than  a simple  one  (e.g.,  a 
sequentially  stored  list).  More  precisely,  store  small  subsets  (those  of 
size  less  than  a certain  parameter  q ) sequentially.  This  scheme  will 
certainly  reduce  the  overall  complexity  of  the  data  structure,  though  it 
complicates  the  implementation  of  some  set  operations. 

A combination  of  the  M-ary  trie  and  the  small  subset  sequential  storing 
ideas  is  analyzed  by  Khuth  [Knuth  73>  Exercise  6. 3.20].  That  analysis  shows 
that  the  average  number  of  nodes  is  reduced  to  n/(M'q*ln  2)  . Further 
savings  are  obtained  by  storing  only  the  suffixes  of  the  elements  of  each 
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Figure  2.2.1.  A "combo"  representation*  combining  an  M-ary 
trie  and  the  representation  of  sets  of  size 
less  than  q as  sequential  lists  of  the  suffixes 
of  each  element. 


u ' Jitw.1 


3 . The  Trie  Intersection  Algorithm. 

Once  two  or  more  subsets  of  a given  universe  U have  been  represented 
as  tries  in  a consistent  manner,  it  is  possible  to  take  advantage  of  such 
representations  for  the  purpose  of  computing  their  intersection  (or  their 
union,  though  the  -union  algorithm  win  be  discussed  later). 

Let  us  assume  two  subsets  P^  and  P^  of  a universe  U ; both  P^ 
and  P^  are  represented  as  tries  according  to  the  partition  U = U'  (J  U" 

(thus  P^  has  subtries  P^  = P^OU'  and  P”  = P^  0 U"  and,  similarly, 

P2  = P2  ^ U'  and  P2  = Now  tlie  in'tersec'tion  PinP2  (within 

the  universe  U ) can  be  computed  as  follows; 

Intersection^^,  P , U) 

= Intersection (P^, P^,  U’ ) U Intersection(P£, P^U")  (3.0.1) 

but  since  both  intersections  on  the  right  hand  are  disjoint,  the  union 
is  simply  a juxtaposition,  requiring  constant  time. 

The  above  recursion  should  proceed  until  the  computation  becomes  trivial, 
either  because  both  sets  are  small  enough  or  because  one  of  them  is  empty. 

And  that  is  where  the  savings  are;  when  either  one  of  the  sets  is  empty, 
the  other  one  can  be  disregarded  immediately.  On  the  average  the  above 
situation  arises  very  often,  especially  if  the  starting  sets  are  of 
different  sizes. 

In  order  to  analyze  the  method,  a measure  has  to  be  introduced,  and 
the  number  of  trie  nodes  visited  seems  plausible  since  the  amount  of  work 
work  at  each  node  is  bounded  by  a constant. 

So  we  will  now  pick  up  the  discussion  about  intersection  time  started 
in  the  introduction.  Again,  assume  the  sets  ‘‘‘'^m  are  rePresen,te(i 

as  binary  tries  within  a universe  U , with  P = (|P^(,  |Pg|,...,  |P  |>  the 
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sequence  of  sizes  and  Q , |q)  = q , their  "arbitrary"  intersection. 

We  will  estimate  the  average  intersection  time  (given  in  Equation  (1.4)) 
as 

l(P,q)  < V(P,q)  + t(P)  (3.0.2) 

where  v(P,  q)  is  as  before  and  t(P)  is  a bound  on  the  wasted  effort 
t(P,  q)  . 

In  order  to  bound  t(P,  q)  , notice  that  since 

lwil  = IVQI  < lpi  I ' (5*0.3) 

verifying  that  the  intersection  of  the  random  subsets  is  empty  and 

must  take  less  time  than  computing  the  overall  intersection  of  total ly 
random  sets  of  sizes  |P^ | (that  is,  sets  whose  intersection  size  is  not 
arbitrarily  preset) 

t(P,q)  < l(P,0)  = t(P)  . (3.0.4) 

Thus,  Section  3.1  analyzes  t(P)  as  the  running  time  of  the  algorithm 

g 

over  random  sets  in  a universe  U = [0, 2 ) , and  produces  a hopelessly 

s 

inscrutable  answer,  so  Section  3*2  analyzes  the  limiting  case  of  the 
universe  of  reals  U = [0,1)  that  allows  for  an  asymptotic  expansion 
for  t(P)  (presented  in  Section  3*3).  Finally,  Section  3.4  analyzes  the 
overall  intersection  time  and  discusses  some  examples. 

A Bit  of  Notation. 

Most  of  the  equations  in  this  chapter  will  deal  with  m-tuples,  and 
a concise  notation  is  needed.  An  m-tuple  will  be  denoted  by  a letter  with 
an  overbar;  let  A=  (a^  . . . , am)  and  B=  (b^b^, . . . ,bm>  . A scalar  v 

will  expand  to  the  m-tuple  v = (v, v,  ...,v)  . 
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Operators  will  be  extended  to  apply  to  m-tuples 


A * B = (a^*b^,  ...  , ato*b  ) for  denoting  basic  arithmetic 

operations 

&AB  = ^a,b,  ' •'*  • 5°  * > 


a®  , ^1  ^m. 

A = <a-L  , • • . > am  > 


11 


a b 
m m 


a:  - <a am:  > 


(:)■<© c:)> 


Operations  on  scalars  and  m-tuples  will  denote  the  coercion  of  the 
scalar  to  the  corresponding  m-tuple  followed  by  the  operation  itself. 
For  instance; 


Finally  some  operators  will  map  m-tuples  onto  scalars 

Ha  = a..  + . . . + a TT  A = an  » . . . • a 

1 m 11  1 m 

A < B = (a.  < b. ) a . . . A (a  <b  ) 

— 1—1  m — m 

A ^ B = (A  < B)  A ( (a1  ^ b^  v . . . V (am  ^ bffi) ) 


5-1  Wasting  Effort  in  the  Finite  Universe. 

Assume  that  the  sets  P^,  Pg*  ' ' ' ’ ^m  are  ran<io:m  subsets  of  the  universe 

U = [0,2  ) , and  have  sizes  P = (p,  ,p0, . . . ,p  ) respectively.  We  are 
s x c.  in 

interested  in  the  running  time  t (P)  of  the  intersection  algorithm 

s 

for  random  sets  P^,  P2,...,Pffi  . 


2h 


Given  that  the  binary  trie  representation  partitions  the  universe 
into  U’  = [0,2S_1)  and  U"  = [2S_1,2S)  , we  have  that  for  any  sequence 
of  sets  P,, ....P^ , such  that  | P± ( = p±  , the  algorithm  will 
take  the  lower  half  partitions  P^  = P^  CIU'  , . . . , P^  = PfflnU'  and 
intersect  them,  and  then  do  the  same  for  the  upper  half  partitions 


P" 

1 


pi  n u"  . 

Hence,  assuming  non-empty  sets. 


0 J P < 2S 


and  s > 1 , 


t (P) 
s 


where 


Z Bart(P,R,s).[ts_1(R)  + t^P-R)]  + v (5.1.1) 
5 <R<P 


Part(P, R,  s):  probability  of  a partition  R (lower  half)  and  P-R 
(upper  half)  of  a sequence  of  sets  of  size  P ; 
v:  time  required  to  visit  the  m roots  of  the  tries 

representing  the  sets. 


Equation  (5. 1.1)  does  not  cover  the  cases  where  at  least  one  of  the 
sets  to  be  intersected  is  empty,  or  the  boundary  case  when  the  universe 
is  of  size  1 . So 

""  for  P:  ai,  p±  = 0 let  t (p)  = 0 
/ (the  intersection  is  known  to  be  empty)  (3.1.2) 

for  s = 0 tQ(P)  = 0 . 

V 

Now  the  probability  of  a set  P^  (of  size  p^  ) splitting  into  upper 

and  lower  subsets  P!  and  P7  (of  sizes  p!  and  p'.'  = p.-p.  ) is 

1 1 1 *1  1 1 ' 
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Part(P,  R, s)  = — * 


/ v r-n  / 


IT 


(•;) 


(3.1.3) 


and  obviously  Part(P,R,  s)  = Part(P,P-R, s)  , so  Equation  (5*1.1)  becomes 

r 


for  s > 0 


t (P)  = 2 E 
s v ' 


— i — g ' k_P~R  L . t (R)  + v • 


1 < R < P 


ts(R)  + v.TT(l-6->0)  (3.1-M 


t0(P)  = 0 

(Notice  that  the  lower  bound  1 in  the  sum  is  a consequence  of  Equation  (3.1.2).) 


A further  simplification  can  be  obtained  by  defining 


Tg(P)  = TT(  |.ts(P) 


(3.1.5) 


yielding 
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n if  ^,77 


<*■*■*» 


for  s > 0 


Ts(P)  = 2 E Tlj 
1 < R < P 


.s-1 


P-R 


)Ts-i®  - 1 ) 


TT(l-5.  ) 

P,0 


(3.1.6) 


Tq(P)  = 0 


Now  Equation  (3.1.6)  can  be  solved  by  means  of  the  generating  function 


Ts  = Tg(z)  = E Ts(P)  TT(z  ) 


(3-1.7) 


where  z denotes  the  "m- variable"  > and  thus 

12  m' 


P.  P1  P2 

TT(z  ) = z±  * z2 


m 


So 


m 


,s-l‘ 


Ts  ■ S 2 r ff|  . . K-l®  ^ 

I<P  I < R < P 1 


+ v • E TT 
I <P 


TT(zP) 


(3.1.8) 


^ f2  \ -P  -2s  -2S 

Defining  A = E TT  Tf(zr)  = [rr(l+z)]  = Y , where 

P 

Y = TT(1+z)  , and  noticing  the  convolution  in  the  first  sum  of  (3.1.8): 


T = 2 A ,*T  , + v • B 

s s-1  s-1  s 


where 


Bs  = 

1<P 


TT(zP) 
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It  is  now  a matter  of  iterating  Equation  (3.1.9)> 


Ts  - 2 Vl(2  V2  V2  + ,-Vl> 


+ V B 


- 2 Vi  Vs  T-2  + 2vVl  Vl+T  Bs 


2k  A , A - ...  A , T n + V £ 22  A ....  A . B . 

s— 1 s-2  s-k  s-k  s-1  s-.i  s-,i 

0<  j <k  d 0 

k+1  k 

2 As-1  As-2  *•*  As-k  As-k-l  Ts-k-l+2  v As-1  "*  As-k  Bs-k 


+ v E 2J  A .....  A . B . . 


0 < j <k 


19  9 • . -i-J  . 

s-o  s-j 


And  now,  recalling  that  Tn  = 0 , 


7 = v.  E 2°  A ...  A B . 

S 0 < j < s S X S J S J 


(3.1.10) 


A , ...  A . = Y' 
s-1  s-j 


s-1  s-j  2s  s-j 

. . . i = Y 


= TT  (1+  z) 


-,2S-2S-J 


B . = TT  E 
s-j  ^ 


2 -p  r - 2s-j  ■ 

zX  = TT  (1+z)  -1 


giving 


Equation  (3. 1.12)  is  a solution  to  the  original  recurrence,  but  not  a 
very  useful  one.  Not  only  does  it  not  yield  a more  reasonable  closed  form 
or  an  asymptotic  expansion,  but  it  is  very  difficult  to  compute,  due  to  the 
explosive  combinatorial  terms  it  contains! 

Hence,  something  has  to  be  done. 

3.2  The  Infinite  Uhiverse  Model. 

We  will  now  analyze  the  behavior  of  the  intersection  for  the  limiting 
case  universe  U = [0,1)  . 

Let  t(P)  be  the  average  running  time  of  the  intersection  algorithm 

for  random  subsets  P^,P  , ...,Pm  , of  sizes  P=  (P]_> P2,  • • • > Pm)  respectively. 

1 
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No  simple  argument  could  be  found  to  prove  formally  the  rather 
intuitive  fact 

ts(P)  < t(P)  . 

(The  fact  that  the  average  number  of  nodes  of  a trie  representing  a random 
set  grows  with  s (see  Section  A. 2 for  a detailed  argument)  strongly 
supports  th: s intuition). 

In  any  case  it  is  easy  to  see  that  the  probability  Part(P,R,  s) 

(as  defined  in  Equation  (5. 1.1))  of  a partition  tends  to  the  limit  case 

lim  Part(P, R,  s)  = Part(P,R) 

s -»x 

where  Part(F, R)  is  the  probability  of  a partition  in  the  infinite 
universe  U (refer  to  Equations  (A.2.9)  and  (A.2.10)). 

Consequently 

lim  t (p)  = t(P) 

s -4  00 

and  we  shall  adopt  t(P)  as  an  estimate  of  t (P)  without  further 

s 

discussion. 

In  order  to  compute  t(P)  we  observe  that  the  intersection  will 
proceed  by  considering  both  halves  of  U : U'  = [0,1/2)  and  U"  = [1/2,1)  . 

But  both  U'  and  U"  are  similar  to  U in  the  sense  that  the  partition 
probabilities  have  the  same  distribution  in  the  three  cases. 

Hence,  by  looking  at  al 1 possible  partitions,  we  arrive  at  the  following 
recurrence: 

t(P)  = £ Part(P,R)[t(R)  + t(P-R)]  + v (3-2.1) 

5<R<P 

where  v is  the  time  required  to  visit  the  m roots  of  the  tries. 
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The  boundary  condition  is  simply 


for  P:  gi,  pi  = 0 t(P)  = 0 . 

And  the  probability  of  a random  set  P^  c U , |P^|  = 


PinU' | = r±  i£ 


and  thus 


(3.2.2) 


p.  to  be  of  the  form 


Part(p,R)  = IT 


CP  ■ i’-} 


(3-2.3) 


So  Equation  (3.2.1)  becomes 


t(P)  = 2 I 2"^  IT  I • t(R)  + v • Tf(l-8_  ) 

- v r y po 

1<R<P  ' ' 


An  exponential  generating  function  will  take  care  of  Equation  (3.2.1+) 


(3-2.1+) 


T = T(z)  = £ t(P)  IT  — 

s Vp: 


2 d r z it i _ i . t(R) 

p Li<R"; 


P “ 

2 • p: 


i<p 


(3.2.5) 


The  first  term  of  the  previous  equation  results  from  the  convolution  of 


1 55  y -f  s5 

e = £ TT  — 


? l_2P  • PI 


and  j(l/2  z) 
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The  sum  in  (3.2.9)  is 


some  asymptotic  results. 


3.3  Asymptotic  Behavior. 


The  function 


U 


n 


(3-3.1) 


that  appears  very  often  in  the  study  of  binary  tries  will  be  the  basis 
for  the  asymptotic  expansion  of  Equation  (3.2.11),  since  it  is  possible 
to  express  t(p)  in  terms  of  Un  . 

First  we  need  to  get  rid  of  the  products  of  combinatorial  numbers 
in  Equation  (3.2.11)  by  noticing  that  for  any  P and  x 

ZxJ  Z TT ( ) = [(l+x)Pl-l]  ...  [(l+x)Fm-l] 

i 

i<  j 

= Z (l+x)LeP  (-1)^ 

0<€  < I 

= l (-D*-2'  z(2sy 

0 < e <1  J V 0 / 

= E xJ  Z A^pV_i)m+Z€  # (3.3.2) 

^ o<£<! V d / 

[Notice  that  the  index  e in  0 < e < 1 ranges  through  all  the  m- tuples 
of  0 ' s and  1 ' s . ] 
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Looking  at  the  first  and  last  expressions  in  (3.5.2)  it  is  easy  to 


see  that 


a( 

= ^ J 

0 < e < I V J ' 


Now  we  can  massage  (3.2.11)  into  better  shape; 


t(P)  ?!  v • Z TT 


r;  i_2-sr+1 


m<  r 1-2 


= v.  Z (-D™  £ tt  _ + v-Z 

m<r  VR J m<r  2 -1 


Z TT 


£R  = r 


r+m  / P 

Z TT  . 

- 1 = V R 


(3.3.5) 


(3.3-1+) 


£R=  r 


£R  = r 


The  first  sum  is  Equation  (3-3.2)  with  x = -1  and  an  extra  factor 
of  v*(-l)m  , hence  its  value  is  v . 

The  second  sum' s index,  r , ranges  from  m , but  since  the  inner 
sum  is  zero  for  2 < r < m , 


1 + T,  

2 < r 2r_1-l 


r+m  / P 

Z TT 

- 1 ? VR 


ZR=  r 


1+  £ Z 

2<r  2 -1  0 < e < I V r 


/ , Nr  / EeP 

i + Z (-i)Ee  z 

0 < e < 1 2<r  2 -1  ^ " 


Now  Equation  (3.3.5)  can  be  expanded  using 

Un  = n[lg  n + f(n)]  + cn  + 0(1)  (3.3-6) 


where  f(n)  is  a strange  periodic  function  of  "average  value"  zero 
(see  Section  A. 3 for  further  details)  such  that 

(f(n)|  < fQ  < 2xl0"7  . 

So,  discarding  the  lower  bound  e = 0 in  (3-3.5) 

t(p)  = v.  E (-l)£e  EeP[lg  £eP+  f (£eP) ] + 0(1)  (3.3-7) 

0 j e <1 

since  c Z (-l)Ee  EeP  cancels  itself  out. 

0 < e < I 

Equation  (3.3.7)  provides  us  with  the  means  to  visualize  the  behavior 
of  trie  intersection.  To  account  for  the  f(EeP)  terms,  we  first  notice 
that 

Z (-l)Se  EeP  f(EiP)  < f o • Z EeP  = fQ  • 2m_1  sP  (3-3.8) 
0 < e < I 0 < e < 1 

which  shows  that,  even  without  considering  cancellations,  we  can  safely 
ignore  the  contribution  of  the  f(EeP)  terms  when  intersecting  sets  of 
size  0(l/f0)  , and  thus 

t(P)  = v-  Z (-1)£G  EeP  lg  EeP  + o( small)  , (3-5-9) 

€ < 1 

where  0( small)  stands  for  the  periodic  terms  that  we  will  henceforth 
discard. 
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3 .h  Overall  Bounds  on  Trie  Intersection. 

Equation  (3.3.9)  has  finally  provided  us  with  a more  manageable 
bound  on  the  "wasted  effort"  for  the  trie  intersection  algorithm.  But 
we  still  have  to  estimate  the  time  V(P,  q)  spent  verifying  that  an 
arbitrary  intersection  Q = P1  f|  Pg  Cl  • • • fl  Pm  is  a proper  subset  of 
each  set. 

Consider  a set  P and  a proper  subset  Qc  P ; Figure  3-k.l  shows 
the  binary  trie  representation  for  P , the  shaded  nodes  would  also 
appear  in  the  trie  for  Q . 

In  order  to  verify  that  Q c P we  need  to 

• traverse  all  the  shaded  nodes  (the  trie  for  Q,  ) 

• every  time  that  the  boundary  is  reached  we  need  to  follow  a path  to 
the  element  in  P that  belongs  in  Q . 


Thus  we  can  estimate  that  the  average  time  v(p>q)  for  random 
set  P , | P | = p , and  subset  Q < P , |q|  = q is 

v(p,q)  = + q(lg  p " lg  q) 

“ q( lnT  + 16  q ) O'4'1) 

yielding 

V(P,q)  = Zvtp^q)  = m • + q £ lg(p  /q)  . (3-4.2) 

i i 

Combining  Equations  (3.3.9)  and  (3.4.2)  results  in 
I(P,  q)  < v-  Z (-l)^e  EiP  lg  EiP 

5le<I 

+ q["^2  + S !g(pi/q)  J (3.^.3) 

our  final  bound  for  the  trie  intersection  algorithm. 

Lastly,  let  us  discuss  a couple  of  particular  cases. 


Intersection  of  two  sets;  Equation  (3.4.3)  becomes 
ifPj.P^q)  < VJj  lg^l  + + v p2  lg^l  + 

+ ’(l • 

It  is  interesting  to  analyze  the  "wasted  effort"  l(p1>p2,0) 
separately: 


for  PX«P2 


for  px  w p2  : 


i(p1>p2^o)  = 0(p1  lg  p2) 
i(p1>p2^o)  = 0(p1  + p2) 


for  P;L  » p2  : l(Pl,p2,0)  = 0(p2  lg  Pl) 


(3-4.4) 


38 


v 


In  other  words,  when  the  sets  are  different  in  size,  trie  intersection 
behaves  like  a series  of  binary  searches  and  as  they  become  comparable  in 
size,  the  intersection  time  becomes  the  usual  ordered  list  intersection  time. 


fl 


Intersection  of  sets  of  equal  size;  Assume  P = p , and  that  the 
time  v to  visit  a given  node  is  simply  m , then 


I(p>q)  < p-m-I^+q-m.  + ^ q 


•"•[eh?  ♦ *?] 


(3.U.5) 


where 


3 >2 

The  factor  Lm  does  not  seem  to  decrease  fast  enough  (see  Table  3.4.2 
below)  to  make  the  parallel  computation  of  the  intersection  competitive 
with  a sequence  of  intersections  (intersect  P^flP^  , then  (P^fiP^) 
with  P^  , and  so  on)  that  would  very  quickly  "weed  out"  the  elements 
that  do  not  belong  in  Q,  . But  three-way  intersection  may  be  better  than 
two. 


m 

L 

m 

m • L 
m 

2 

2.0 

4.00 

3 

1.25 

3.7^ 

4 

0.98 

3.92 

5 

0.34 

4.21 

10 

0.58 

5.85 

20 

0.45 

8.98 

Table  3, 

.4.2. 

Tabulation  of 

4.  Hashed  Representation  and  Intersection  of  Sets. 


Searching  for  an  element  in  a given  set  can  be  a very  fast  operation; 
this  suggests  the  following  intersection  procedure: 

Let  P and  R be  the  sets  to  be  intersected.  Choose  the  smallest 

one  (say  P ),  and  search  for  each  one  of  its  elements  in  R . 

The  average  time  to  intersect  the  above  two  sets  is 

t(P,R)  = C(R) . (P|  (U.0.1) 

where  C(R)  is  the  average  time  to  search  for  an  element  in  R . And 
every  practitioner  of  the  arcane  mysteries  of  computing  knows  that  hashing 
methods  require  (on  the  average)  only  a constant  time  per  search,  so 
the  intersection  of  two  sets  P and  R of  sizes  p and  r will 
require 

t(p,r)  = c.min(p,r)  . (b.0.2) 

So  hashed  intersection,  as  we  shall  call  the  intersection  method  loosely 
described  above,  seems  difficult  to  beat;  its  running  time  is  proportional 
to  the  size  of  the  smallest  set  and  the  constant  c can  be  made  duly 
small  by  choosing  a good  enough  hashed  representation. 

But  not  everything  is  rosy:  accessing  a hash  table  is  usually  done 
in  a random  fashion,  thus  requiring  a large  working  set  for  the  algorithm; 
further,  the  worst  case  running  time  can  be  very  bad  (as  bad  as  0(p*q)  ). 

The  first  part  of  this  chapter  deals  with  different  hashed  representations. 
The  second  part  refers  to  the  hashed  intersection  algorithm,  especially  to 
the  problems  that  we  have  just  stated,  presenting  solutions  to  them  and 
discussing  some  of  the  trade-offs  involved. 
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U .1  Storing  the  Sets. 


This  section  discusses  different  ways  of  representing  unchanging  sets 
utilizing  hashing  techniques. 

In  order  to  represent  a set  S c U , we  need 

• a hash  array  T=  <tQ, . . . , t > 

• a hash  function  h : U -*  [0,M)  ; 

and  • an  unhash  function  h ^ : [0,M)  xT  -*  U . 

Now  the  elements  of  S are  hashed  into  T by  means  of  the  hash  function 
h and  a suitable  collision  resolution  strategy.  Enough  information  about 
each  element  is  stored  in  T so  that  further  retrieval  is  possible  by  means 
of  the  unhash  function  h ^ . 

We  will  further  assume  that  all  the  hash  functions  chosen  are 
probabilistically  "fair",  in  the  sense  that  for  all  x,  yeU  and  for  all 
at  [0, M)  : Prob[h(x)  = a]  = Prob[h(y)  ==  Of]  = 1/m  . 

The  problem  of  optimal  storage  of  unchanging  sets  by  means  of  hashing 
techniques  has  very  seldom  been  addressed  (see  [Greniewski  and  Turski  63] 
and  [Sprugnoli  77]).  One  possible  reason  for  this  absence  of  interest  is 
that  normal  hashing  techniques  are  reasonably  good  for  static  sets. 

We  will  nevertheless  investigate  some  basic  techniques  and  discuss 
some  of  the  extra  advantages  resulting  from  the  availability  of  preprocessing. 

Throughout  this  section  and  as  a means  of  comparison  we  will  use  the 
same  Example  2.0.1  used  in  Chapter  2.  We  will  also  refer  to  a universe 

g 

U = [0,2  ) and  a subset  S = fs,,s_, ...,s  ] c U . Also,  given  a hash 
s *■  1 2 nJ  — s 

function  h , hash(h,S)  = (h(s^),  h(s,j), . . . , h(sn)  ) will  be  the  hash 
sequence  of  S . 


Again  we  will  be  interested  in  measuring  the  number  of  bits  B^(s,n) 
that  representation  i requires  for  random  sets  of  size  n . Also,  the 
average  retrieval  times  R^(s,n)  (for  a successful  search)  and  R_.  (s,n) 
(■unsuccessful  search)  will  be  presented.  Retrieval  time  will  be  measured 
in  terms  of  the  number  of  hash  function  evaluations  and/or  comparisons 
needed. 


The  Basic  Representation. 

For  fast  retrieval  and  simplicity,  nothing  beats  hashing  with  collision 
resolution  by  separate  chaining  (see  [Knuth  75*  Section  6.1]).  The  i-th 
entry  t^  of  the  hash  table  T is  of  the  form 


tt  = <tagi  , vi  , linki> 


where 


tag^  is  occupied  if  3x  e S:  i = h(x)  , 
is  free  otherwise; 

v^  is  an  element  of  the  set  S (or  a special  empty  value); 

link^  points  to  the  next  element  in  the  chain  that  hashes  to 
h(v^)  (or  it  is  a special  null  value). 

(Figure  1.1.1  shows  the  sample  set  represented  in  this  fashion.) 

The  size  of  the  table  M and  the  size  of  the  set  |s|  = n determine 
the  load  factor 


a = n/M 


(1.1.1) 


i. 


(It. 1.2) 


(4.1.5) 


The  above  retrieval  times  could,  be  improved  by  sorting  the  collision  chains 
but  the  improvement  is  minimal.  (Notice  that  even  with  a full  table, 

U = 1 , retrieval  requires  fewer  than  1.5  probes  on  the  average.) 


A More  Economical  and  Flexible  Representation. 

The  fact  that  an  element  hashes  to  a given  address  provides  information 
about  that  element  that  can  be  used  by  the  unhash  function,  thus  requiring 
less  information  to  be  stored  in  the  v -fields  of  the  table.  Roughly 
speaking,  the  hashing  address  provides  us  with  lg  M bits  of  information 
and  only  about  (s  - lg  M)  bits  that  need  to  be  stored.  (This  idea  is 
due  to  Butler  Lampson,  see  [Knuth  75>  p.  518].) 

For  instance,  if  the  hash  function  is  h(x)  = x mod  M , then  we  only 
need  to  store  (_ x/M J , since  x = |_x/mj  +x  mod  M , and  |_x/MJ  requires 
only  (s  - lg  M)  bits. 

There  is  also  a useful  trade-off  between  space  and  retrieval  time: 
by  separating  the  hash  table  and  the  value  fields  and  storing  the  values 
sorted  by  hash  address  we  can  vary  the  size  of  the  hash  table  at  will, 
changing  the  retrieval  times  correspondingly. 
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Two  tables  are  needed:  the  hash  table  proper,  T , with  entries 
t^  = (link^)  0 < i < M 

link^  points  to  the  value  table  (if  the  entry  is  free, 

linle  points  to  the  first  value  of  the  chain 

corresponding  to  the  next  occupied  entry,  thus 

link.  = link. , , ) 

1 l+l 

and  a value  table,  V , |v|  = n , 

vk  = <rk> 

r.  is  such  that  if  h(x)  = i and  link.  < k < link.,,  , 

k ' ' i - l+l  ' 

then  h (i,  r,  ) = x 
(see  Figure  4.1.2  for  an  example). 

The  above  representation  makes  unhashing  of  a single  element  a rather 
difficult  task  (since  given  an  entry  k in  V , the  corresponding  hash 
address  i must  be  figured  out  by  searching  T for  link^  < k < link^+1  ), 
but  the  intersection  algorithm  requires  unhashing  of  whole  sets  only,  and 
that  is  easily  accomplished  in  0(n)  time. 

The  ordering  of  the  v fields,  a simple  preprocessing  task,  is 
important  for  two  reasons:  first,  it  eliminates  the  need  for  pointers  in 
the  V table  and,  second,  it  cuts  the  unsuccessful  search  time  by  about 
half,  since  the  search  in  the  collision  table  is  now  a search  on  an  ordered 
list. 

The  space  requirements  are 

Bg(s,n)  = M(lg  n)  + n[  s - lg  M] 

= nj^s  - (lg  n)f  l-i^  + lga"j  . (k.1.4) 
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link. 
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i tag± 


0 

occ. 
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null 
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occ. 
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2 
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null 
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3 

occ. 

3 

null 

1 

occ. 

25 

• 

— 

5 

occ. 

200 

null 

6 

occ. 

51 

• 

J 

7 

free 

2l9 

null 

8 

occ. 

50 

null 

M = n = 9 a = 1 
h(x)  = (97x)  mod  9 
h_1(i)  = v± 

b1(8,9)  = (1+8+1+)  x 9 
= 117  bits 


Figure  1.1.1:  Hashing  with  separate  chains. 
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Figure  1.1.2;  Separate  hash 
value  fields. 


element 


(25, 

M = 5 

50, 

a = 9/5 

130, 

200) 

h(x)  = x mod  5 

(51) 

h_1(i,  k)  = 5r  +i 

(232) 

K 

(3) 

b2(8,9)  = (5+1)  x 3 

(for  T) 

(189, 

+ 9X6 

(for  V) 

219) 

= 72  bits 

and  value  tables;  abbreviated 


1 

t 


b5 


The  retrieval  times  count  the  average  number  of  accesses  to  both  T 


\ 


I 


(4.1.5) 


Coping  With  Worst  Cases;  Persistent  Hashing. 

Everybody  knows  that  one  of  the  best  ways  to  punish- a computer  scientist 
is  to  make  her/him  meditate  about  the  incredibly  horrendous  worst  case  of 
hashing,  without  the  protection  of  probabilities.  So  it  might  seem  that 
little  can  be  done  about  worst  cases,  but  a recent  result  in  probability 
theory  together  with  the  availability  of  preprocessing  helps  in  the  case 
of  static  sets. 

Hashing  fits  into  the  traditional  probabilistic  model  of  throwing  n 
balls  into  M urns  (the  hash  table),  each  ball  falling  into  any  urn  with 
uniform  probability  1/M  . The  hash  function  acts  as  a randomizing  agent 
that  distributes  the  n elements  of  the  set  more  or  less  uniformly.  The 
problem  is  that  the  maximum  number  of  balls  falling  into  an  urn  can  be 
as  high  as  n . Here  is  where  a result  that  follows  from  research  done 
by  Persi  Diaconis  and  David  Freeman  [Diaconis  and  Freeman  78]  is  useful. 

Theorem  D.  In  the  occupancy  model  defined  above,  let  oc  = n/M  and 
a « lg  n . Let  Mg(n)  he  the  maximum  number  of  balls  falling  into 
any  single  urn,  then 

1+6 


and  V , and  are 

R2(s,n)  = 2 + f + 0(M_1) 

R'(s,n)  = 2 + f - + 0(M_1) 

(see  [Knuth  75],  Exercise  6.1+-55). 
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lim  Prob[0  < MQ(n)  - m*(n,a)  < 2]  = 
n -*  oo 

where  m*(n,a)  - ^i^lna  ) 

In  short,  the  worst  case  of  hashing  is  almost  always  going  to  be 
less  than  the  logarithm  of  n . 

But  what  happens  if  not?  For  example,  a hash  function  like  h(x)  = x mod  M 
yields  a worst  case  of  n for  sets  of  the  form  S = {a+  iM  | 0 < i < n]  . 

The  answer  is:  change  the  hash  "unction.  If  we  have  a good  set  of 
hash  functions  we  can  simply  keep  trying  until  wTe  hit  one  that  is  good 
enough. 

And  thus,  persistent  hashing:  given  a family  H of  hash  functions 

and  a set  S , randomly  choose  a hash  function  h e H . Now  hash  the  set 
S with  h and  check  for  the  worst  case:  if  this  turns  out  to  be  worse 
than  In  n then  pick  another  h'  eH  at  random  and  repeat  the  procedure 
until  the  worst  case  condition  is  satisfied. 

Given  the  above  procedure,  we  could  assign  the  "truly  randomizing" 
blue  ribbon  to  the  family  H if,  for  any  set  S , independently  random 
hash  sequences  hash(h1, S),hash(h2, S), . . . are  produced  where  the  hash 
functions  h^, h^, ...  are  randomly  chosen  within  H . Notice  that  in 
this  case  the  expected  number  of  trials  (of  different  hash  functions) 
would  be  just  1 . 

Further  research  is  required  to  characterize  the  behavior  of 
practical  hash  families  with  respect  to  persistent  hashing.  (J.  L.  Carter 
and  Mark  N.  Wegman  [Carter  and  Wegman  77 1,  developed  a related  idea: 
universal  classes  of  hash  functions,  for  the  usual  case  of  changing  sets.’) 
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1+7 


1 

And  finally,  recursive  hashing. 

The  previous  paragraphs  presented  a technique  (persistent  hashing) 
to  partition  a set  S into  subsets  of  at  most  In  |s|  elements.  The 
recursive  application  of  persistent  hashing  to  each  one  of  the  partitions  il 

of  S until  reaching  an  acceptable  number  of  collisions  is  what  we  will 
call  recursive  hashing.  (A  particular  instance  of  this  technique  can  be 
found  in  [ Sprugnoli  77 ] . ) 

For  instance,  Figure  4.1.5  shows  the  representation  of  Figure  4.1.2, 
with  the  first  collision  list  also  hashed. 

Notice  that  the  bounds  for  space  and  average  retrieval  time  are  at 
least  as  good  as  the  second  representation  considered,  but  the  worst  case 
can  be  considerably  improved. 

An  interesting  open  problem  is  to  investigate  the  trade-offs  involved 
in  choosing  load  factors  for  the  hash  tables  at  different  levels. 

4.2  Hashed  Intersection. 

Time  Bounds  for  Multiple  Set  Intersection. 

In  order  to  intersect  m sets,  we  simply  intersect  the  two  smallest 
ones,  then  the  resulting  intersection  with  the  third  smallest,  and  so  on. 

So  assume  the  sets  Pp>F2;‘",'Pm  of  s^-zes  ^ = (Pq’Fp*  * • oPm)  and  that 
the  sets  are  ordered  by  size  so  that  p^  < < •••  < pm  . Let  also 

Q = ?■]_  n P2  n • • . n Pm  he  the  intersection  (|q(  = q)  , and  VL  = - Q 

be  a.1 1 the  elements  of  that  do  not  belong  in  the  intersection. 
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h(x)  = x mod  5 for  this  table:  h x(i,k)  = 5r^+i 


h (y)  = y mod  4 for  this  table: 

h”1(i,k)  = (4rk+i)  x 5 + 0 = 20rk+ 5i 


Figure  4.1.5* 


An  example  of  recursive  hashing. 


According  to  our  discussion  in  the  first  chapter,  the  average 


intersection  time  can  be  analyzed  as 


l(P,q)  = t(P,q)  + V(P,q) 


where  t(P,  q):  average  wasted  time  (time  to  verify  that 

w1nw2n ...  nwm  = 0 ) 

and  V(P,  q) : time  needed  to  verify  that  each  element  of  Q 

belongs  to  each  one  of  P,  ,P„, ...,P 

12'  m 

In  order  to  evaluate  t(P, q)  we  use  the  hypothesis  that  the 
subsets  VA  are  random  subsets  of  U-Q  , that  is: 

|W.  | 

for  all  xeU-Q  : Prob[xeW.  ] = - — - — - , 

1 |U-Q| 

which  means  that  the  expected  sizes  of  the  intersections  and 


(1+.2.1) 


(I+.2.2) 


will  be 


|W,  | |Wp| 

E(  w nw  |)  = — ±-~  x — — — - X | u - Q | 
1 2 U-Q  U-Q 


or  in  general 


e(|w  nw  n...  nw  |)  =(  Jf  |w,|V|u-q|‘ 

\ 1<i<k  J 


(^.2.3) 


and  since 


we  get 


i_wiL  „ iwji  + m iPji  Pj 

|U-Q|  ~ | U - Q | + | Q | " | U | " 2S 


E(|w  nw  n...nw  I)  < 2s  JT  (P./2S) 

1 < i < k 1 


px  • TT  (Pi/2S) 

2 < i < k x 


(U.2.U) 


So  now  we  are  ready  to  evaluate 

t(p,q)  = £ E(|winw2n...  nwk|)  • R'(s,pk+1)  ; (4.2.5) 

l<k<m 

that  is,  the  k-th  step  of  the  intersection  requires  the  time  needed  to 
retrieve  each  element  in  fl  W2  0 . . . 0 Wk  from  Wk+^  , 


t(p,q)  < C*  • p 


1 I TT  (P±/2  ) 

1 < k < m 2 < i k 


(1.2. 


where  C*  = maximum  average  retrieval  time  (successful  or  -unsuccess fill 
over  all  sets. 

The  sum  in  (4.2.6)  is  easily  bounded  by 

r TT  (Pi/sh  < r 

l<k<m  2<i<k  l<k<m 


1 - p /2" 
illl, 


(4.2.7' 


yielding 


t(p,q)  < C*-p1* 


1 - P /2‘ 


(4.2.6) 


The  time  needed  to  verify  the  intersection  equals  the  time  needed 
to  intersect  Q and  each  , thus 


V (P,  q)  < C • m • q 


(4.2.9' 


where  C = maximum  average  successful  retrieval  time  over  all  sets. 


So  we  obtain 


l(F,  q)  < C .p,. + C • m • q 

— J-  /^s 


(4.2.10 


1 “ vd2' 

m 
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Finally,  we  should  notice  that  if  the  sets  to  be  intersected  belong 
to  a given  family  ip  of  sets  (the  indexes  of  a library,  for  instance) 
we  can  define  the  global  maxima 


C(p):  maximum  successful  retrieval  time  over  all 

sets  of  p 

C*(p) : idem,  both  successful  and  unsuccessful 

p = max  [ |P|  / 2S] 

Pep 

so  that 

l(P,q)  < C*(p)  p,  — + C (p)  m q 

- 1 x " P 


(U.2.11) 


(1+.2.12) 


that  is:  the  intersection  time  is  linear  on 

• the  size  of  the  smallest  set  to  be  intersected; 

s the  product  of  the  number  of  sets  and  the  size  of  their  intersection. 


The  Working  Set  Size. 

In  most  practical  cases  the  sets  to  be  intersected  will  be  stored  in 
a hierarchical  memory  system  (some  kind  of  paging  scheme,  a conventional 
file  structure,  etc. ) and  it  is  crucial  to  minimize  the  total  number  of 
transfers  between  memory  hierarchies  (the  overall  working  set  size). 

In  the  case  of  trie  intersection,  the  algorithm  runs  through  the 
tries  in  preorder.  By  sequentially  storing  the  trie  nodes  in  that  order, 
a simple  preprocessing  task,  the  working  set  is  minimized  since  once  a 
given  block  of  information  (a  page  or  a physical  block  in  a file)  is  used, 
it  will  not  be  required  again.  But  hash  tables  are  accessed  in  a basically 
random  fashion  so  that  the  same  block  of  information  might  be  required  at 
different  times  when  computing  the  intersection. 
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If  no  preventive  measures  are  taken,  the  number  of  transfers  is  going 
to  be  exactly  the  quantity  t(P, q)  evaluated  before.  Equation  ( U . 2 . 10  ^ , with 
the  constants  C and  C taking  the  value  of  the  average  number  of 
transfers  required  for  a retrieval  operation. 

But  suppose  we  want  to  intersect  the  sets  P^  and  P^  (stored 
using  hash  functions  h^  and  h^  ) : if  we  managed  to  look  for  the 
elements  of  P^  in  the  same  order  that  they  are  (or  would  be)  sequentially 
stored  in  P^  , then  we  would  have  minimized  the  number  of  transfers 
involved.  This  idea  can  be  utilized  in  two  ways; 

(i)  Sort  P^  using  the  set  of  values  h^(P^)  as  sorting  keys.  Now, 

P^  may  have  been  preprocessed  so  that  an  orderly  (according  to  h^  ) 
traversal  minimizes  the  number  of  transfers  (for  instance;  the 
second  representation  considered  in  the  previous  section  does  it). 

But  a solution  like  this  introdv-v-s  a sorting  step  with  the 
logarithmic  factor  attached  to  it. 

(ii)  Make  h^  and  h^  define  compatible  orderings,  in  the  sense  that 
h^(x)  < h1(y)  iff  h^(x)  < h2(y)  . (For  instance,  define  a unique 
global  scrambling  function  h*;  U -«  [0,1)  , and  if  h^;  U - [0,M^) 

then  iu(x)  = i_h  (x)  *M^J  .)  Such  a strategy  eliminates  the  possibility 
of  choosing  a better  hash  function  by  "persistent  hashing",  though 
we  still  may  use  a scheme  like  "recursive  hashing"  with  the  In  ' s as 
first  level  hash  functions  for  all  the  sets. 

There  is  yet  another  way  of  dealing  with  the  working  set  problem; 
reduce  the  size  of  the  data  structure  used  to  compute  the  intersection. 

A technique  to  do  so  is  analyzed  in  the  next  subsection. 


i 
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The  Use  of  Hashed  Maps. 

A hash  function  h:  U -*  [0,M)  induces  a partition  of  the  universe 
U into  M equivalence  classes.  Given  a set  S , h(S)  -will  he  the 
h-map  associated  with  it.  (Normally  h(S)  will  he  represented  as  a 
hit  string  of  M hits.) 

The  important  point  about  hashed  maps  is  that  if  a given  element  x e U , 
hashes  into  h(S)  , that  is  h(x)  e h(s)  , then  the  probability  of  x 
belonging  to  S is  larger  than  the  usual  probability  |s | / |u|  . 

This  last  fact  was  originally  exploited  by  Burton  H.  Bloom  [Bloom  70] 
in  a method  for  reducing  the  number  of  accesses  to  secondary 
storage  when  dealing  with  very  large  hash  tables.  (Current  research  by 
Robert  W.  Floyd  [Floyd  77]  also  deals  with  hashed  maps.) 

The  intersection  of  two  sets  P^  and  may  now  be  computed  as 

follows : 

(i)  Hash  the  elements  of  P^  using  h^  . Let  P^  be  the 
subset  of  F'^  that  hashes  into  ^(Pr,)  • 

(ii)  Perform  the  usual  intersection  between  P^  and  P^  . (Notice  that 
the  subset  P^  - P^  that  does  not  hash  into  h2(P2)  ’ ^oes  not 
intersect  P^  . ) 

If  7 g is  the  probability  of  falsely  accepting  x , that  is 

y 2 = Prob[h2(x)  c h2(P2)  and  x{  Pg] 

then  the  first  step  will  weed  out  a fraction  (1  -72)  of  "the  elements 
of  P2  that  do  not  belong  in  the  intersection,  thus  reducing  the  work 
in  Step  (ii). 
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So  let  us  now  take  a closer  look  at  hashed  maps.  Given  a set  S 


we  will  associate  with  it  a "composite  hash  function"  h consisting 


of  d functions  h^,h^, . . .,h^  hashing  into  ft j = [0,M) 


An  element  x hashes  into  a subset  of  ft[  , the  "hash  set"  of  x : 
h(x)  = {h^  (x)  | 1 < i < d] 


(4.2.13) 


and  the  hashed  map  of  S is 


h(S)  = U h(x)  . 
xeS 


(4.2.14) 


(Notice  that  for  an  element  to  belong  in  S , its  hash  set  must  be 
included  in  h(S)  : 


VyeU:  h(y)  <£  h(S)  =»  y^S  .) 


We  estimate  the  probability  y of  falsely  accepting  an  element  x . 
Let  $ be  the  probability  of  an  element  of  fti  not  to  belong  in  h(S)  ; 
then 


$ - (1  - d/M)n  * e-^M  . (4.2.15) 


The  probability  of  a given  element  x being  accepted  is  the  probability 
that  all  of  h^(x)  belong  in  h(S)  , thus 

vd 


y 


(1-0  - | S | / |U|  . 

Now,  the  probability  of  x belonging  to  S , |s|  / |tj|  , may  be  assumed 

for  all  practical  purposes  to  be  |S | / |u|  « ' 

7 « (1  - e ' ) = (1  - e ) 


(4.2.16) 


so 


(4.2.17) 


where  a.  = n/M  is  the  load  factor  of  the  hash  map. 


For  a given  load  factor  , there  is  a value  of  d^  that  minimizes 


y to  an  optimum  y , 
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For  instance,  with  three  hash  functions,  dQ  = 3 > and-  using  about  4.3 
bits  per  element  of  S , aQ  = 0*23  > we  get  7 ^ = 1/8  . If,  on  the  other 
hand,  only  one  hash  evaluation  is  used,  the  probability  increases  to 
about  0.21  (still  using  the  same  load  factor,  a = 0.23  )• 

In  order  to  estimate  the  time  I (P, q)  needed  to  intersect  m sets 
of  sizes  P = (Pj/Pg* • • *>Pm)  > we  may  use  exactly  the  same  analysis  that 
yielded  l(P,  q)  in  the  first  part  of  this  section,  obtaining 

< C>)  px  + C*(p)  Px  + C (f>)  mq 

= (C*(p)  + 7 c*(p))  Px  3-7-  + c(p)  mq  . (4.2.19) 


where  C(p)  and  C (p)  are  maximum  average  number  of  retrieval 
tries  as  in  (4.2,11) 

and  C*(p)  is  the  average  time  required  to  check  whether  h(x) 

belongs  in  a hashed  map. 

(The  term  C (p)  p^  r/(l-p)  is  a bound  on  the  time  needed  to  reject  the 
elements  falsely  accepted  by  the  hashed  maps.) 

Two  important  characteristics  of  the  use.  of  hashed  maps  may  be 
singled  out: 

First,  by  checking  the  hashed  maps  first,  it  is  possible  to  obtain 
a good  estimate  of  the  size  of  the  intersection  (the  estimate  will  have 

IT1“1 

an  error  of  at  most  p^  • 7 ).  This  could  be  an  important  advantage 


r 

in  interactive  query  systems  (to  produce  answers  like:  "Your  request 
covers  at  most  38  papers.  Do  you  want  them  listed?"). 

Second,  the  hashed  maps  are  small  compared  with  the  full  set 
(l/a  bits  versus  rou&hly  s bits  per  set  element),  so  if  the  right 
kind  of  hash  function  is  chosen  for  the  hashed  maps,  a sensible  reduction 
in  the  number  of  transfers  from  secondary  storage  may  be  obtained.  (One 
possibility:  a unique  global  hash  function  for  all  the  hashed  maps,  as 
discussed  in  the  previous  subsection.) 


The  Worst  Case,  Revisited. 


We  have  already  discussed  the  worst  case  of  retrieval  time  for 


individual  sets.  Now  it  is  time  to  look  at  the  worst  case  when  computing 

the  intersection  of  our  beloved  m sets  Pn,P0, ...,P  (as  usual,  sizes 

12m' 

are  F = (p^p^  . . .,pm>  ). 

Having  a bound  on  the  worst  case  for  retrieval  yields  a worst  case 


for  the  intersection 


W(P)  = Z p1  . In  pk  < P1(m-1)  In  pfl  (U.2.20) 

1 < k < m 


that  is  still  rather  terrible. 

One  solution  consists  in  ordering  the  traversal  through  the  different 
sets  by  adopting  a global  hash  function,  and  storing  the  collision  lists  in 
order.  Assume  now  we  want  to  intersect  P^  and  P^  . The  algorithm 
proceeds  as  usual  except  that  a "last  visited  element"  pointer  is  kept 
up-to-date.  Whenever  a new  element  of  P^  has  to  be  searched  for,  the 
last  visited  element  is  checked:  if  it  belongs  in  the  same  collision  list 
as  the  element  being  searched  for,  the  search  starts  with  the  "next -to-last 
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visited  element".  This  scheme  (extended  to  simultaneous  intersections 
of  m sets)  guarantees  that 


W(P)  < Lp 


(4.2.21) 


since  no  element  in  the  sets  will  ever  he  visited  twice. 

And  (4.2.21)  is  not  such  a bad  worst  case  (similar  to  the  worst  case 
for  sorted  sequential  representation),  so  we  close  this  chapter  happily. 


Conclusions. 


About  what  has  been  presented. 

This  work  presented  two  different  methods  to  store  and  operate  on 
sets,  one  based  on  tries  and  the  other  on  hashing  techniques. 

In  both  trie  and  hashing  based  representations,  the  storage  requirements 
were  very  good;  less  than  what  would  be  needed  by  sequentially  allocating 
the  same  set  and  very  close  to  the  theoretically  minimum  requirements. 

Sets  were  considered  static  and  a certain  amount  of  preprocessing 
time  was  assumed  to  be  available.  Trie  building  reduced  mainly  to  a 
sorting  process.  In  the  case  of  hashed  representation,  some  kind  of  sorting 
was  used  but,  also,  preprocessing  time  could  be  used  choosing  a convenient 
hash  function  from  a pre-existing  family. 

In  general  the  representations  here  presented  have  a very  good 
average  retrieval  time  (logarithmic  for  the  trie  structures  and  constant 
for  the  hashed  representations).  The  worst  case  is  the  logarithm  of  the 
size  of  the  universe  for  the  trie  structures,  and  by  means  of  preprocessing, 
may  be  reduced  to  logarithmic  in  the  hashed  case. 

Very  few  of  the  representations  allowed  for  easy  update  procedures. 

On  the  assumption  of  sets  being  static,  updates  were  supposed  to  be 
processed  as  a recreation  of  the  whole  set. 

And  finally  intersection.  Trie  intersection  turned  out  to  have  good 
average  running  times  when  the  sizes  of  the  sets  to  be  intersected  are 
different.  On  the  other  hand,  the  algorithm  behaves  like  the  usual 
ordered  list  intersection  for  sets  of  similar  size.  The  algorithm  proceeds 
by  an  orderly  traversal  through  the  data  structure,  and  this  fact  minimizes 
the  overall  working  set  (provided  the  tries  have  also  been  stored  in  an 
* orderly  fashion) . 


Hashed  intersection  proved  to  be  the  fastest  gun  in  the  West;  its 
wasted  effort  is  truly  minimal  (bounded  by  a constant  times  the  size  of 
the  smallest  set  to  be  intersected,  and  independent  of  the  number  of  sets 
being  intersected).  But  some  problems  had  to  be  faced;  a random  access 
pattern  to  the  data  structure  that  could  pessimize  the  working  set,  and 
the  possibility  of  a horrendous  worst  case.  Techniques  were  presented 
to  get  around  these  latter  problems,  though  it  is  not  clear  how  much  the 
solution  to  the  problems  could  affect  the  excellent  run  time. 

About  what  has  not  been  presented. 

Perhaps  the  greatest  absentee  in  this  work  has  been  the  discussion 
of  set  union  algorithms,  and  we  will  briefly  cover  them  here. 

Both  representations  permit  the  implementation  of  efficient  union 
algorithms.  It  must  be  remarked  that  by  union  we  mean  "destructive” 
union,  where  one  of  the  sets  is  replaced  by  the  union  of  itself  and  other 
sets,  (if  the  result  is  going  to  be  generated  anew,  then  there  is  not 
much  to  be  gained  over  the  ordered  list  union. ) 

In  the  case  of  trie  representation,  assume  we  wish  to  compute  the 
union  of  two  sets  P-^  and  , and  store  the  result  in  P^  . The 
algorithm  proceeds  by  traversing  the  nodes  common  to  both  tries.  When 
reaching  a subtrie  in  P^  that  is  empty  in  P^  , the  algorithm  simply 
appends  it  to  P^  and  proceeds.  This  procedure  has  exactly  the  same 
complexity  as  the  intersection  (since  only  common  nodes  are  visited). 

So  on  the  average  the  time  needed  to  compute  the  union  of  two  sets  of 
sizes  p^  and  p2  will  be  given  by 


T 


f p2  \ ( Pi  % 

Union(p  ,p  ) = p lg  1 + — - i + p lg f 1 + — i . (5.1) 

x V P1  / V P2  i 

Notice  that  the  union  modifies  the  data  structure,  so  none  of  the 
preordered  representations  could  be  used.  (The  reader  is  referred 
to  [Brown  and  Tarjan  77 ] for  a more  complete  discussion  of  the  union 
operation  and  another  implementation  of  it. ) 

In  the  case  of  hashed  representations,  a union  algorithm  is  also 
possible  provided  some  sort  of  overflow  handling  representation  is 
adopted.  In  this  case  the  average  time  required  is  simply 

Union(p1,p2)  = c • min(p1,p2)  (5.2) 

but  it  is  not  clear  whether  the  above  statement  is  fair  since  the  outcome 
of  the  union  is  a "deteriorated"  set  (with  a larger  load  factor  and, 
perhaps,  a worse  worst-case  retrieval  time) . In  any  case,  some  way  of 
revamping  such  sets  may  be  needed  for  multiple  set  intersection. 


And  now  ...  for  something  completely  different;  practical  considerations. 

This  work  presented  a set  of  techniques  and  discussed  them,  but 
before  using  them  in  the  real  world  many  choices  have  to  be  made.  Let 
us  assume  for  a while  that  we  wish  to  implement  a data  base  by  means  of 
inverted  files. 

Choosing  a representation  and  tuning  it  to  the  computer  system 
characteristics  is  the  first  task.  Important  factors  for  this  decision 
are  the  machine  addressing  and  bit  handling  facilities  (do  I pack  nodes 
* or  do  I respect  byte  boundaries?)  and  the  memory  hierarchy  (how  big  a 


bucket?  l^/hat  about  the  expected  working  set  size?) 


A second  major  set  of  decisions  is  performance  related.  Space-time 
trade-offs  (what  load  factor  in  a hash  representation?).  Or  what  price 
for  updating?  (hatch  the  updates  or  leave  some  slack  in  the  data 
structures  and  reorganize  the  data  base  every  so  often?)  Or  decide 
about  extras  (like  hashed  maps).  Or  real-time  requirements  (use  some 
kind  of  persistent  or  recursive  hashing  in  order  to  improve  the  worst- 
case  retrieval?) 

A final  choice  consists  in  the  collection  of  statistical  data  to 
verify  and  improve  the  assumption  made;  How  big  are  real-life  intersections 
compared  to  the  sets  they  come  from?  What  does  a query  look  like? 

Enough.  Let  us  go  back  to  Academia. 

Further  Research. 

Some  of  the  problems  here  presented  deserve  further  analysis. 

e Characterize  randomizing  families  of  hash  functions  and  further 
investigate  persistent  hashing. 

e Investigate  recursive  hashing.  What  space-time  trade-offs  are 

worthwhile?  How  can  the  worst  case  be  improved  (assuming  preprocessing) 
s Given  a data  base  organized  by  means  of  inverted  files,  what  is  the 
best  way  to  process  general  queries  (including  intersections  and 
unions)?  What  kind  of  "planning  aids"  (like  hashed  maps)  should 
tie  stored  in  the  aata  base  to  help  decide  how  to  process  a given 


query? 


Appendix  A. 


II 


This  appendix  investigates  the  asymptotic  average  behavior  of 
binary  tries. 

Section  A.l  discusses  the  number  of  nodes  A^(s, n)  of  a trie 
representing  a random  set  of  n elements  from  a universe  [0,2s)  . 
Section  A. 2 discusses  Ag(s,n)  further,  in  particular  the  fact  that 

Ag(s+1, n)  > A2(s,n)  , 

and  its  limiting  case  when  s grows  indefinitely. 

The  rest  of  the  appendix  uses  the  universe  U = [0,1)  , 
and  the  average  set  within  it.  Section  A. 3 presents  the  basic  counting 
equation.  Section  A. k counts  nodes  with  two  nonterminal  siblings  (two- 
way  branching).  Section  A. 5 computes  a bound  for  the  space  needed  to 
store  pointers  (P^(s,n))  in  representation  5 (of  Chapter  2).  And 
finally,  Section  A. 6 discusses  the  growth  of  A2(s,n)  for  increasing 
values  of  s . 

A.l  Counting  Nodes:  A^(s,n)  . 


Section  2.1  defines  the  quantity  A2(s,n)  as  the  average  number  of 
nodes  of  a trie  representing  a random  set  S defined  by 


I 


(A.  1.2) 


L 


A~( s,n)  = 


"5nO  _6nl^1  _5sO^ 

( 2S_1 ) ( aS  1 ) 

V.  p n-p  [A  (S_i,p)+A  (s-l,n-p)] 

e;) 


+ s 

p 


2 ^ is  the  probability  of  a partition  (p,n- 


( since 

Defining 

B(s,n)  = ( 2n  ')^2(s>n)  , 


P)  )■ 


(A.l-3) 


and  given  the  symmetry  of  (A. 1.2), 

B(s,n)  = ( nS)(1-Sn0-V)(l-&so)  + 2 pf0(n-p)B(s'1,P)  ' 


The  generating  function 


Bg(z)  s Bs  = £ B(s,n)z 


■ 2 ( 2nS  K + 2 £ ["  S?(  n-p 

n >2  V n ' n |_P>2  J 


Bn(z)  s Bn  = 0 


(A. 1.5) 


can  be  expressed  in  terms  of 

- nsjs»y 


(l+z)*  -1-2  z 


(A.1.6) 


M ■ FS  " L (2n  y - (1+Z) 


and  iterated 


6U 


B = G + 2 F . B , 

s s s-1  s-1 


yielding 


G + 2F  , G , + 2 F , F 0B  0 

s s-1  s-1  s-1  s-2  s-2 


£ 20  F . F . ...  F . G . 
„ ^ _ £-1  s-2  s- j s-j 


0<j<s 


£ 2J(l+z)  a [(1+z)2  -1-2  Jz] 


>S"J 


0 < j < s 


A 2(s,n) 


= 2-1  - 


(n) 


0 < j < s 


[^TM2*^3) 


• (A. 1.7) 


A. 2.  Bounds  and  Asymptotic  Estimates  for  Ag(s, n)  . 

Equation  (A.1.7)  is  of  no  help  in  estimating  the  number  of  nodes. 
So,  first  we  will  bound  Ag(s,n)  . 

Lemma  A. 2.1.  A~(s+l,n)  > Ag(s,n)  . 


s+1 

Proof.  Given  a set  P in  Ug+1  = [0,2  ) , |P|  ==  n , define  its 

prefix  as  the  set  in  U = [0,2s)  : 

s 

pref(P)  = { L x/ 2 J | xeF]  . (A.2.1) 

Now,  the  set  P can  be  uniquely  generated  from  pref(P)  by 

(i)  appending  both  a "0"  and  a "1"  to  the  (|p|  - |pref(p)|)  elements 
of  pref(P)  such  that 

X2_  € pref  (P)  and  2x1,2x1+leP  (A.2.2) 

(ii)  appending  either  a " 0 " or  a " 1 " to  the  rest  of  the  elements  of 
pref(p)  ) according  to  the  corresponding  element  in  P . 
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If  we  now  look  at  the  trie  for  P , Figure  A. 2.1,  we  can  see  that  it 
has  the  same  node  structure  as  pref(P)  with  the  addition  (for  each  of 
the  elements  x^  defined  in  (A. 2. 2)  above)  of  a branch  that  goes 
all  the  way  down  to  the  bottom  level.  Elements  like  xn  , the  rest  of 
pref(P)  , do  not  add  any  new  nodes. 

So  if  we  partition  Uc+^  into  classes  according  to  the  size  of  the 
prefixes,  with 

p(s+l,n,k)  = probability  of  (|p|  - |pref (P) | = k)  (A. 2. 3) 

we  can  express 

Ap(s+l,n)  = Tj  p(s+l,  n,  k)  [A  ( s, n-k)  + k Jump(s,n-k)]  (A.2.J+) 

d k 

where  Jump(s,n-k)  = average  distance  between  a terminal  node 
and  the  bottom  level  s for  a trie  of  (n-k)  elements 
in  Us  = [0,2S)  (depicted  as  branch  " J"  in  Figure  A. 2.1). 

In  order  to  use  Equation  (A.2.U)  we  need  the  following: 

Claim  A. 2. 2.  A^Sjn-l)  + Jump(s,n-1)  > A^(s,n)  . 

Proof.  We  can  generate  (n-1)  copies  of  each  set  P , (P  c Us  , |P|  = n) 
by  taking  each  one  of  the  elements  x in  P and  adding  it  to  the 
complementary  set  P - [x]  . 

Thus 

A 2(s,n)  = A2(s,n-1)  + a(s,n-l)  (A.2.5) 

where  a(s,n-l)  is  the  average  increment  when  adding 
a new  element 

and  certainly 
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since  at  most  (n-1)  of  the 


elements  that  may  be  added  will 


force  the  addition  of  a full  branch  to  the  bottom  level 


Equations  (A. 2. 5)  and  (A. 2. 6)  yield  the  claim 


Figure  A. 2. 2.  Adding  an  element  to  a set  of  size 


A new  element  like  a does  not  increase  the  node  count  at  all 


A new  element  like  (3  will  produce  a branch  all  the  way  to  the  bottom 


Going  back  to  our  lemma; 


given  the  Claim  and  the  fact  that 


Equation  (A.2.h)  becomes 


Once  we  know  that  the  average  number  of  nodes  grows  with  the  size  of 


the  universe,  we  need  to  know  what  happens  in  the  limit,  when  the  universe 


size  goes  to  infinity 


The  probability; 


n , having 


which  has  the  limiting  distribution 


And  that  happens  to  be  the  same  probability  distribution  of  the  infinite 


universe  U = [0,1)  presented  in  the  Introduction  and  extensively  covered 


in  [Knuth  73>  Sections  5.2.2  and  6.5] 


Hence 


Lemma  A. 2. 3-  lim  A^(s, n)  = A(n)  , where  A(n)  is  the  average  number  of 

S — * <-o 

nodes  of  a trie  representing  a random  set  of  n real  numbers  in  U = [0,1)  . □ 

We  move  on  to  study  A(n)  and  related  topics. 


A. 3 The  Universe  of  Real  Numbers. 

This  section  simply  recapitulates  some  useful  results  presented  in 
[Knuth  73,  Section  5.2.2]. 

The  family  of  equations 

( ” ) 

<”>=> 


X„  = x,  = 0 


(A. 3-D 


has  solutions 


,k-l  - 


x = 
n 


( k ■ “n+  E (t  )(-1)k  ITT-  (A. 5-2) 

t >2  ' * ' 2 -1  n k > 2 ' J 2*  -1 


"here  \ ’ ^(k)(‘ 


D ^ 


denotes  the  binomial  transform  of  the  sequence  (a^)  . Some  of  the  most 
popular  ones  are 


1 = B, 

nO 

A 

n = -5 


nl 


( M = (-i)n  * 

\ m ) ' nm 


u n 

a = (l-a)n 

(?>“  - (:)<-»>  W 
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(A. 3. 3) 


' V£j  v.  U 4 > ^44. i 


The  equations  above  define  many  properties  of  tries  representing  sets 


in  U = [0,1)  . In  particular  the  number  of  nodes 

A(n)  = l-6n0 2 £ f ")-2-n-A(k) 

k > 2 v • 


has 


an  “ 5 nO  ” 1 + n 


and  thus 


A<“>  • 


■ i-  £ (i)-g£-*  £ (D-&S- 

k>2'k'2*-l  k>2'k'2kl-l 


= 1 - U + V 
n n 


The  functions  Un  and  presented  above  have  the  refreshing 


property  of  possessing  asymptotic  expansions; 

n = n lg  n + n ( TT2  ~ I + f-l(n))  + °(1) 


U 


V = n lg 
n & 


n+  n(iHv-|  - V-1))  * oa) 


(A.3. 


[a.3. 


(A.3. 


The  functions  f 1(n)  and  fQ(n)  (defined  in  the  answer  to  Exercise  5.2.2-46 
[Knuth  75])  are  periodic  functions  of  lg  n , and  are  extremely  small: 


f_x(n)  < 2 X10 


-7 


and  fn(n)  < 2x10 


-6 


(A.3. 
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Furthermore  both  functions  have  "average  value"  zero  in  the  following 
sense;  let  f be  either  f ^ or  f^  , then  it  turns  out  that 
f(n)  = f(lg  n mod  1)  and  for  random  n , lg  n mod  1 is  uniformly 
distributed  (see  [Knuth  69;  Section  U.2.U]). 

Since  almost  all  the  quantities  that  we  will  be  dealing  with  have 
asymptotic  terms  of  the  form  n*f(n)  , it  is  very  difficult  to  give 
theoretically  valid  asymptotic  estimates,  though  for  all  practical 
purposes  the  n*f(n)  terms  are  negligible. 

In  any  case,  and  for  precision' s sake,  the  concerned  reader  should 
regard  many  of  the  expressions  below  (and  their  counterparts  in  Chapter  2) 
simply  as  good  estimates  of  the  leading  term  of  the  corresponding  asymptotic 
expansions. 

Throughout  the  rest  of  the  appendix  we  shall  use  the  catchall  term 
O(small)  to  denote  terms  of  the  form  0(n-f(n))  . 

So  Equation  (A. 3. 5)  becomes 

A(n)  = + °(sma11)  + °(1)  • (A. 3- 8) 

Given  the  discussion  above  we  shall  assume 

A(n)  = JJT2  ‘ (a.3.9) 


A. 4 Counting  Nodes  with  Two  Nonterminal  Siblings. 

This  quantity,  yn  , is  used  to  bound  P^(s,n)  in  Chapter  2 
(Equation  2.1.13)). 

A node  with  n descendants  has  two  nonterminal  subtries  whenever 
none  of  its  subtries  has  size  either  0 or  1 . 

So  the  quantity  an  in  Equation  (A.3.I)  becomes 
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The  second  sum  is  studied  in  [Knuth  73>  Exercise  6.3-I9]. 

S2  = | K(n,2,2)  = J jjS-  + O(small)  + 0(1)  . 

Finally 


yn  = ! - 21-n  - n 21-n  + | + &nl  + &n0  + n 21"11  + 21_n  - 2 


2 n2  nl  nO 
+ ^ 2^-2  + 0(  small)  + 0(1) 


= J + 0(  small)  + 0(1) 


and  we  shall  adopt 


1 n 

yn  " ¥ In  2 


= 0.36n 


A. 5 Space  Requirements  for  Pointers. 

Equation  (2.1.19)  defines  the  space  x^  needed  for  pointers  -under 
representation  5*  But  the  value  of  a^  = (1  - 5nQ  - 5^)  (2+e)  lg  n does 
not  allow  for  an  easy  solution.  So  we  will  take  an  easy  way  out; 
approximate  an  as 


= (1-B 


nO 


fi.-s-l 

|_  In  2 In  2 J 


so  that 


p = (2+e) (x  - z ) 

•^n  ' ' v n n' 


where  z is  trivially 
n 


In  2 


A(n)  = 


(In  2)‘ 


n 


(A.1+.5) 


(A.4.6) 


(a. 4.7) 


(A.5.1) 


(A.5.2) 


(A.5.3) 


and 
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(A.5.M 


H 


Cl) 


Xn  " ^ "5nO  "^nl^  In  2 + 2 ^ ’ n'  xk 

k >2  2 


Given  that 


Z 

k >1 


(£)<- 


l)k  H, 


we  obtain 


n ' In  2 


And  thus 


k"l 


(in  2)xn  - (1  - ^no  " ^nl'Hn  + k^2  ( k ?k-l  _ 


2-1 


= (i  _ » _ *,  )H  + V - £ , 

^ nO  nl'  n n ^n  ’ 


where 


k >2 


(£)<- 


1) 


1 

k k 


2^-1 


k >2 


fn-lU  ..(ill! 
V k J k _k-l_ 


+ i Z 


2 -1  k >2  N “ ' 2-1 


f n \ Lilli 

V k ) >1  _ 


= z,, 


h-l  n 


So 


£ = Z -f  = Z lg  f + nfxK%  " | ) + 0( small)  + 0(1) 

£<n  Kn  ^ •' 


= lg(nl ) + n 


(2ZL  .1) 
(in  2 2 J 


+ 0( small)  + 0(1)  . 


(A. 5. 5 ) 


(a.5.6) 


(A.5.7) 


(a.5.8) 
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Replacing  into  (A. 5.6), 

(In  2)xn  = (l-»n0-&nl>Hn+n  ^ n+n{lh  ~ l) 

-n  lg  n - | lg  n + ^ - \ ) + 0( small)  + 0(1) 

= (1-&nO"6nl)Hn+2  ln~2  + °(sma11)  + * (A-5-9) 

Yielding 

x = — — -x  n + 0(ln  n + small)  . (A.5.10) 

n (In  2)2 

And  finally  we  will  adopt 

p = (2+e)f %—x  n ^ ^ n”l  = (2+e)  • 2.96n  . (A.5-11) 

n L (ln2r  (m2)2  J 

The  above  result  is  interesting:  if  pointers  are  stored  as  offsets, 
and  encoded  according  to  representation  5,  only  a constant  amount  of 
space  (per  element  in  the  set)  is  needed. 

A. 6 More  About  the  Relation  Between  the  Finite  and  Infinite  Universe. 

We  would  like  now  to  answer  the  question:  how  fast  does  A^^n) 
approach  A(n)  ? 

Although  at  first  glance,  Equations  (A.  1.7)  and  (A.3-5)  ho  not  appear 
to  be  similar,  we  can  demonstrate  that  they  actually  have  the  same 
asymptotic  behavior  for  most  practical  values  of  n . 
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Let  us  first  obtain  a lower  bound  for  A 2(s,n)  , transforming 
Equation  (A-3.5) 


P^isyn)  = 2 - 


Z 2J  + 1 - I 2J 

0<j<s  0 < j < s 


r 2^ 

0<  j <s 


«) 

(aS;r) 

(?) 


(A. 6.1) 


We  can  bound  the  second  sum  since 


( a-p  \ r 

\ey- 


(A.6.2) 


and  thus 


L 2 

0 < j < s 


( 2S-2S-^ 

j V n ^ < 2J'(l-2"j)n  . 


(?) 


(a.6.3) 


0<  j < s 


For  the  third  sum  in  (A. 6.1)  notice  that 


ff) 


a-n+1 


0 < k < n 
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igCij 


(^r) 


0<J<S  (») 


< 2 


s n 


z (i-2-^'  )n_1 


2 -n+1  0 < j < s 


, nfl  + 

2 -n+1  / 


Z (l-2'J)n 

0 < j < s 


< 


„ r (i-2-J)11-1  + 4 

0<  j <s 


2S -n 


Finally  therefore 


A (s,n)  > Z [2J  -2,3(l-2"J)n-n(l-2"J)n'1] 

0 < j < s 


2 

n • s 


2S-n 


Now  we  will  transform  A(n)  as  given  by  Equation  (A.J*5) 


AH  ‘ 1+K>2(:;)<‘ 


1) 


k k-1 


2k_1-l 


’ 1 ( n>L)  Vl)  • ^ T 


1+  Z 

0<j  k >2 


= 1-  E fn  Z (n;1V-2-J)k+2’j  Z (;)(-S 
0<j  L k >1  ' k l k >2  V ' 


1 - Z [n[(l-2-J)n_1- 1]  + 2J[(l-2-j)n-l+n2"J]] 
0<j 


1+  Z [2J  -2J(l-2‘j)n-n(l-2‘J)n_1]  . 


Comparing  Equations  (A. 6.6)  and  (A.6.7)>  and  recalling  that 
A(n)  > Ap(s,n)  , it  is  easy  to  see  that 

A(n)  > Ap(s,n)  > A(n)  - E [2J  - 2J (1-2_J )n  - n(l-2_J )n_1] 

s<  j 


2 

n s 
2S-n 


And  we  can  bound  the  sum  since 


(A. 6. 8) 


Thus 


E [2J  - 2^(l-2-j)n  - n(l-2-^)n_1] 
s<  j 

< s (k)k2-(2"(s+1))k_1 


= 2n  ^E  ( n-1)[2-(s+1)]k 


< 2n  E [2_(S+I)]k 

k>l  k- 


= 2n 


[^(jSr)-1]  - "(?) 


A(n)  > Ag(s,n)  > A(n)  - 0^-^-  ^ 


And  since  A(n)  = we  may  conclude  that 


A2(s,n)  = A(n)f  1 


-<*)) 


(A.6.9) 


(A.6.10) 


(A.6.11) 
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